How to retrive in this scenario?

I have daily news source articles scrapped from Custom search engine, for my Knowledge Base. I have more than 60 topics that I use to perform search on.

In my local chatbot app I am asking question like: “What is the latest on Intel?”

What is the best retriver to use in this kind of scenario?

Normal semantic search donot give relevent chunks in top_k.

1 Like

Sounds like you’re building a personal RAG system and looking for the best retriever for a knowledge base filled with scraped news articles. I actually worked on a similar project not long ago, and we wrote a blog about it that might help you out. It covers how we built the system, the improvements we made, and the results we got (pay attention to the sections Contextualized Chunks and Information Retrieval techniques):
:backhand_index_pointing_right: https://www.ridgerun.ai/post/on-premise-retrieval-augmented-generation-system-how-we-designed-and-implemented-a-rag-for-ridgerun

In our case, we started with standard semantic search, but it wasn’t enough to get the most relevant chunks at the top. What really helped was adding a re-ranking step using ColBERT, plus tuning how we chunked the documents.

Also, just in case you haven’t already done this, I’d suggest reviewing how you generate your chunks. Scraping often brings in a lot of noise or extra characters that can mess up the embeddings and hurt retrieval quality.

There’s no one-size-fits-all retriever, so depending on how your data is structured, some methods might work better than others.

Hope this helps, and feel free to reach out if you want to dig into the details.


Adrian Araya
Machine Learning Engineer at RidgeRun.ai
Contact us: support@ridgerun.ai

1 Like